Semi-supervised natural language acquisition

نویسندگان

  • Roi Reichart
  • Raanan Fattal
  • Amir Globerson
  • Mark Sammons
  • Vivek Srikumar
  • Katrin Tomanek
چکیده

Natural Language processing (NLP) is a field that combines linguistics, cognitive science, statistical machine learning and other computer science areas in order to compile intelligent computer systems that can understand human languages. NLP has various applications, among which are machine translation, question answering and search engines. The field of NLP has, in the past two decades, come to simultaneously rely on and challenge the field of machine learning. Statistical methods now dominate NLP, and have moved the field forward substantially, opening up new possibilities for the exploitation of data in developing NLP components and applications. Many state of the art natural language algorithms are based on supervised learning techniques. In this type of learning, a corpus consisting of texts annotated by human experts is compiled and used to train a learning algorithm. While supervised learning has made substantial contribution to NLP, it faces some significant challenges. Many fundamental NLP tasks, such as syntactic parsing, part-of-speech (POS) tagging and machine translation, involve structured prediction and sequential labeling. For such kind of tasks, compiling annotated corpora is costly and error prone due to the complex nature of annotation. I refer to this challenge as the annotation bottleneck. A closely related challenge is that of domain adaptation. Supervised algorithms usually perform well when the training data and the data for which they should provide predictions (the test data) are drawn from similar domains. When an algorithm trained with data from one domain is to provide predictions for data taken from a substantially different domain, its performance markedly degrades. Creating corpora for every test domain is not feasible due to the aforementioned annotation bottleneck. Supervised natural language learning is also challenged by methodological problems. Annotation schemes for tasks such as syntactic parsing are often based on arbitrary decisions. Such schemes often provide a detailed description of certain structures while addressing others only briefly. Many applications would benefit from different annotation decisions. This work focuses on developing machine learning techniques that deal with these challenges. Its main theme is utilizing the plentiful amounts of raw text available nowadays, for creating state of the art algorithms that use little to no manually annotated data. Cases where little amounts of manually annotated text is utilized are referred to as semi-supervised learning; while cases where only raw text is used are known as unsupervised learning. This work explores semi-supervised and unsupervised techniques for structured

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data-Driven Graph Construction for Semi-Supervised Graph-Based Learning in NLP

Graph-based semi-supervised learning has recently emerged as a promising approach to data-sparse learning problems in natural language processing. All graph-based algorithms rely on a graph that jointly represents labeled and unlabeled data points. The problem of how to best construct this graph remains largely unsolved. In this paper we introduce a data-driven method that optimizes the represe...

متن کامل

Special semi-supervised techniques for Natural Language Processing tasks

A labeled natural language corpus is often difficult, expensive or time-consuming to obtain as its construction requires expert human effort. On the other hand, unlabelled texts are available in abundance thanks to the World Wide Web. The importance of utilizing unlabeled data in machine learning systems is growing. Here, we investigate classic semi-supervised approaches and examine the potenti...

متن کامل

Semi-supervised Classification for Natural Language Processing

Semi-supervised classification is an interesting idea where classification models are learned from both labeled and unlabeled data. It has several advantages over supervised classification in natural language processing domain. For instance, supervised classification exploits only labeled data that are expensive, often difficult to get, inadequate in quantity, and require human experts for anno...

متن کامل

Scalable Graph-Based Learning Applied to Human Language Technology

Scalable Graph-Based Learning Applied to Human Language Technology Andrei Alexandrescu Chair of the Supervisory Committee: Associate Research Professor Katrin Kirchhoff Electrical Engineering Graph-based semi-supervised learning techniques have recently attracted increasing attention as a means to utilize unlabeled data in machine learning by placing data points in a similarity graph. However, ...

متن کامل

Parsing Natural Language Sentences by Semi-supervised Methods

We present our work on semi-supervised parsing of natural language sentences, focusing on multi-source crosslingual transfer of delexicalized dependency parsers. We first evaluate the influence of treebank annotation styles on parsing performance, focusing on adposition attachment style. Then, we present KLcpos3 , an empirical language similarity measure, designed and tuned for source parser we...

متن کامل

Invited Talk: Domain-adaptation of Natural Language Processing Tools for RE

Natural language processing tools like part-of-speech taggers and parsers are being used in a variety of applications involving natural language, including RE. Such tools, based on statistical models of language, are learnt via supervised machine learning algorithms from human-annotated data. Due to their dependence on annotated data, which is limited in size and genre, these models have a fall...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011